Retrying applying CRs in the `e2e` jobs. #1293

ybettan · 2025-01-12T11:29:56Z

The webhook services are not always available right after the deployments are ready. This is causing a race condition between the services being ready to get requests and the CRs applied to the cluster.

By retrying to apply the CRs we can give the services the time they need to become ready.

/assign @yevgeny-shnaidman @TomerNewman
Fixes #1291

Summary by CodeRabbit

Chores
- Enhanced CI/CD scripts with improved error handling and timeout mechanisms for Kubernetes resource deployment.
- Added retry logic to oc apply commands to handle potential transient failures.
- Implemented timeout mechanisms for pod creation and resource application processes.

openshift-ci · 2025-01-12T11:30:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ybettan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ybettan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2025-01-12T11:30:06Z

Walkthrough

The pull request involves modifications to two Bash scripts used in Kubernetes deployment and build processes. The primary changes introduce a retry mechanism for applying resources and creating pods. Both scripts (ci/prow/e2e-hub-spoke-incluster-build and ci/prow/e2e-incluster-build) now include timeout and retry logic using bash commands that will attempt to apply resources or wait for pod creation multiple times with short sleep intervals, improving error handling and resilience during deployment.

Changes

File	Change Summary
`ci/prow/e2e-hub-spoke-incluster-build`	Added retry mechanism for `oc apply -k ci/e2e-hub` with 1-minute timeout and 3-second sleep intervals
`ci/prow/e2e-incluster-build`	Added retry mechanisms for resource application and pod creation with 1-minute timeouts and sleep intervals

Assessment against linked issues

Objective	Addressed	Explanation
Cherry-pick commit `a0f0d37bcb9b0a1a38bba1096b89aca92d42ce4e`	✅
Resolve race condition with webhook services	✅

The changes directly address the issue by implementing a retry mechanism that allows services more time to become ready before applying Custom Resources (CRs).

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

netlify · 2025-01-12T11:30:23Z

✅ Deploy Preview for openshift-kmm ready!

Name	Link
🔨 Latest commit	`8156f88`
🔍 Latest deploy log	https://app.netlify.com/sites/openshift-kmm/deploys/6783b16f679dd90008663324
😎 Deploy Preview	https://deploy-preview-1293--openshift-kmm.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

ci/prow/e2e-incluster-build (2)

45-45: Add error handling for timeout

The script should handle cases where the timeout is reached. Currently, if the resources can't be applied within 1 minute, the script will fail without a clear error message.
-timeout 1m bash -c 'until oc apply -k ci/e2e; do sleep 3; done'
+if ! timeout 1m bash -c 'until oc apply -k ci/e2e; do sleep 3; done'; then
+  echo "ERROR: Failed to apply resources after 1 minute of retries"
+  exit 1
+fi
Line range hint 48-51: Simplify and improve pod name matching

The current jq queries for matching pod names could be simplified and made more robust by:

Using explicit label selectors with oc get pods

Simplifying the jq query
-timeout 1m bash -c 'until oc get pods -o json | jq -er ".items[].metadata.name | select(.? | match(\"build\"))"; do sleep 1; done'
-POD_NAME=$(oc get pods -o json | jq -r '.items[].metadata.name | select(.? | match("build"))')
+timeout 1m bash -c 'until oc get pods -l job-name=kmm-ci-build -o name; do sleep 1; done'
+POD_NAME=$(oc get pods -l job-name=kmm-ci-build -o name | sed 's|pod/||')

-timeout 1m bash -c 'until oc get pods -o json | jq -er ".items[].metadata.name | select(.? | match(\"sign\"))"; do sleep 1; done'
-POD_NAME=$(oc get pods -o json | jq -r '.items[].metadata.name | select(.? | match("sign"))')
+timeout 1m bash -c 'until oc get pods -l job-name=kmm-ci-sign -o name; do sleep 1; done'
+POD_NAME=$(oc get pods -l job-name=kmm-ci-sign -o name | sed 's|pod/||')
Also applies to: 54-57

ci/prow/e2e-hub-spoke-incluster-build (3)

50-50: Document timeout values

The script uses various timeout values (1m, 3m, 15m) for different operations. Consider adding comments explaining the rationale behind these timeout values to help maintainers understand and adjust them if needed.

+# Retry applying resources for up to 1 minute to handle webhook service availability
 timeout 1m bash -c 'until oc apply -k ci/e2e-hub; do sleep 3; done'

Line range hint 52-54: Standardize build wait patterns

The build wait patterns could be standardized with the pod wait patterns from the other script for consistency.

-timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -er ".items[].metadata.name | select(.? | match(\"build\"))"; do sleep 1; done'
-export build_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -r '.items[].metadata.name | select(.? | match("build"))')
+timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-build -o name; do sleep 1; done'
+export build_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-build -o name | sed 's|build/||')

-timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -er ".items[].metadata.name | select(.? | match(\"sign\"))"; do sleep 1; done'
-export sign_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -r '.items[].metadata.name | select(.? | match("sign"))')
+timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-sign -o name; do sleep 1; done'
+export sign_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-sign -o name | sed 's|build/||')

Also applies to: 57-59

50-50: Consider extracting common retry patterns

Both scripts implement similar retry patterns for oc apply and resource waiting. Consider extracting these patterns into helper functions in a shared script to reduce duplication and ensure consistency.

Example shared functions:

# helpers.sh
wait_for_resource() {
    local namespace=$1
    local resource_type=$2
    local selector=$3
    local timeout=${4:-1m}
    
    if ! timeout "${timeout}" bash -c "until oc -n ${namespace} get ${resource_type} ${selector} &>/dev/null; do sleep 1; done"; then
        echo "ERROR: Timeout waiting for ${resource_type} in namespace ${namespace}"
        return 1
    fi
}

retry_apply() {
    local path=$1
    local timeout=${2:-1m}
    
    if ! timeout "${timeout}" bash -c "until oc apply -k ${path}; do sleep 3; done"; then
        echo "ERROR: Failed to apply resources at ${path} after ${timeout} of retries"
        return 1
    fi
}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36447e9 and 0835c06.

📒 Files selected for processing (2)

ci/prow/e2e-hub-spoke-incluster-build (1 hunks)
ci/prow/e2e-incluster-build (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Analyze (go)

ci/prow/e2e-incluster-build

The webhook services are not always available right after the deployments are ready. This is causing a race condition between the services being ready to get requests and the CRs applied to the cluster. By retrying to apply the CRs we can give the services the time they need to become ready. Signed-off-by: Yoni Bettan <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

ci/prow/e2e-hub-spoke-incluster-build (1)
Line range hint 1-54: Consider architectural improvements for better reliability.

The script has several areas where reliability could be improved:

Inconsistent timeout durations (1m vs 3m vs 15m)

Limited error handling and reporting

No cleanup on failure

Consider these improvements:

Define timeout constants at the top of the script:
readonly RESOURCE_TIMEOUT="5m"
readonly BUILD_TIMEOUT="10m"
readonly ACM_TIMEOUT="15m"
Add a trap for cleanup:
cleanup() {
  # Add cleanup logic here
  echo "Cleaning up resources..."
}
trap cleanup EXIT
Add error reporting function:
report_error() {
  local component=$1
  local message=$2
  echo "Error in ${component}: ${message}"
  # Add logging or metric collection
}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0835c06 and 8156f88.

📒 Files selected for processing (2)

ci/prow/e2e-hub-spoke-incluster-build (1 hunks)
ci/prow/e2e-incluster-build (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

ci/prow/e2e-incluster-build

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Analyze (go)

🔇 Additional comments (1)

ci/prow/e2e-hub-spoke-incluster-build (1)

54-54: Verify the timeout and retry intervals.

The retry mechanism is a good addition to handle race conditions with webhook services. However, consider the following aspects:

The 1-minute timeout might be too short for larger deployments.

The 3-second sleep interval between retries might lead to unnecessary load on the API server.

Consider this improved version with exponential backoff:
-timeout 1m bash -c 'until oc apply -k ci/e2e-hub; do sleep 3; done'
+timeout 5m bash -c '
+  retries=0
+  until oc apply -k ci/e2e-hub; do
+    retries=$((retries + 1))
+    if [ $retries -gt 10 ]; then
+      echo "Failed to apply resources after 10 retries"
+      exit 1
+    fi
+    sleep $((2 ** retries))
+  done
+'
Let's verify the resource application patterns in the codebase:

✅ Verification successful

The timeout and retry intervals are consistent with established patterns.

The current implementation using a 1-minute timeout and 3-second sleep interval aligns with similar resource application patterns throughout the codebase. While exponential backoff could be an improvement, the current approach is consistent with the project's conventions and sufficient for the use case.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check for similar resource application patterns that might need the same retry mechanism
rg -A 2 "oc apply" --type bash
Length of output: 61

Script:
#!/bin/bash
# Search for oc apply patterns with context
rg -C 3 "oc apply"

# Search for timeout patterns
rg -C 2 "timeout.*bash"

# Search for retry/until patterns
rg -C 2 "until.*do"
Length of output: 42582

ybettan · 2025-01-12T13:31:55Z

/retest

ybettan · 2025-01-12T13:51:35Z

/retest

ybettan · 2025-01-12T14:25:18Z

/retest

TomerNewman · 2025-01-13T07:19:50Z

/lgtm

openshift-ci bot assigned TomerNewman and yevgeny-shnaidman Jan 12, 2025

openshift-ci bot requested review from bthurber and chr15p January 12, 2025 11:30

openshift-ci bot added the approved label Jan 12, 2025

coderabbitai bot reviewed Jan 12, 2025

View reviewed changes

ci/prow/e2e-incluster-build Show resolved Hide resolved

ybettan force-pushed the retrying-cr-apply branch from 0835c06 to 8156f88 Compare January 12, 2025 12:11

coderabbitai bot reviewed Jan 12, 2025

View reviewed changes

openshift-ci bot added the lgtm label Jan 13, 2025

openshift-merge-bot bot merged commit d80a1e4 into rh-ecosystem-edge:main Jan 13, 2025
20 checks passed

ybettan deleted the retrying-cr-apply branch January 13, 2025 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrying applying CRs in the `e2e` jobs. #1293

Retrying applying CRs in the `e2e` jobs. #1293

ybettan commented Jan 12, 2025 •

edited by coderabbitai bot

Loading

openshift-ci bot commented Jan 12, 2025

coderabbitai bot commented Jan 12, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

netlify bot commented Jan 12, 2025 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

ybettan commented Jan 12, 2025

ybettan commented Jan 12, 2025

ybettan commented Jan 12, 2025

TomerNewman commented Jan 13, 2025

Retrying applying CRs in the e2e jobs. #1293

Retrying applying CRs in the e2e jobs. #1293

Conversation

ybettan commented Jan 12, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

openshift-ci bot commented Jan 12, 2025

coderabbitai bot commented Jan 12, 2025 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

netlify bot commented Jan 12, 2025 • edited Loading

✅ Deploy Preview for openshift-kmm ready!

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

ybettan commented Jan 12, 2025

ybettan commented Jan 12, 2025

ybettan commented Jan 12, 2025

TomerNewman commented Jan 13, 2025

Retrying applying CRs in the `e2e` jobs. #1293

Retrying applying CRs in the `e2e` jobs. #1293

ybettan commented Jan 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 12, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

netlify bot commented Jan 12, 2025 •

edited

Loading